Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
探测方法允许人们使用外部分类器和统计分析获得存储在神经网络内层中的语言现象的部分表示。预先训练的基于变压器的语言模型被广泛用于自然语言理解(NLU)和自然语言生成(NLG)任务,使其最常用于下游应用程序。但是,几乎没有进行分析,无论这些模型是否已进行了足够的培训或包含与语言理论相关的知识。我们介绍了变压器英语模型(例如Multibert和T5)的时间顺序探测研究。我们依次比较模型在语料库培训过程中学到的语言的信息。结果表明,1)在培训的早期阶段获得语言信息2)两种语言模型都表明能够捕获各种语言的各种功能,包括形态学,语法甚至话语,而它们也可能不一致地失败。被认为很容易。我们还介绍了与其他基于变压器的模型兼容的时间顺序探测研究的开源框架。 https://github.com/ekaterinavoloshina/Chronology_probing
translated by 谷歌翻译
The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).
translated by 谷歌翻译
与人类的比较是基准的基本要求是它是一种可靠的模型能力测量。然而,模型比较方法可能具有基础缺陷 - 单独度量的算术平均值用于不同复杂性的所有任务,测试和训练集的不同大小。在本文中,我们根据其报告的结果,检查流行的NLP基准测试的整体评分方法,并通过几何和谐波(适于平均率)重新排列模型。我们分析了几个流行的基准,包括胶水,超级格,XGLUE和Xtreme。分析表明,例如,SuperGlue上的人类水平仍未到达,目前模型还有改进的余地。
translated by 谷歌翻译
Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.
translated by 谷歌翻译
Simulating rigid collisions among arbitrary shapes is notoriously difficult due to complex geometry and the strong non-linearity of the interactions. While graph neural network (GNN)-based models are effective at learning to simulate complex physical dynamics, such as fluids, cloth and articulated bodies, they have been less effective and efficient on rigid-body physics, except with very simple shapes. Existing methods that model collisions through the meshes' nodes are often inaccurate because they struggle when collisions occur on faces far from nodes. Alternative approaches that represent the geometry densely with many particles are prohibitively expensive for complex shapes. Here we introduce the Face Interaction Graph Network (FIGNet) which extends beyond GNN-based methods, and computes interactions between mesh faces, rather than nodes. Compared to learned node- and particle-based methods, FIGNet is around 4x more accurate in simulating complex shape interactions, while also 8x more computationally efficient on sparse, rigid meshes. Moreover, FIGNet can learn frictional dynamics directly from real-world data, and can be more accurate than analytical solvers given modest amounts of training data. FIGNet represents a key step forward in one of the few remaining physical domains which have seen little competition from learned simulators, and offers allied fields such as robotics, graphics and mechanical design a new tool for simulation and model-based planning.
translated by 谷歌翻译
Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
translated by 谷歌翻译
Cancer is one of the leading causes of death worldwide. It is caused by a variety of genetic mutations, which makes every instance of the disease unique. Since chemotherapy can have extremely severe side effects, each patient requires a personalized treatment plan. Finding the dosages that maximize the beneficial effects of the drugs and minimize their adverse side effects is vital. Deep neural networks automate and improve drug selection. However, they require a lot of data to be trained on. Therefore, there is a need for machine-learning approaches that require less data. Hybrid quantum neural networks were shown to provide a potential advantage in problems where training data availability is limited. We propose a novel hybrid quantum neural network for drug response prediction, based on a combination of convolutional, graph convolutional, and deep quantum neural layers of 8 qubits with 363 layers. We test our model on the reduced Genomics of Drug Sensitivity in Cancer dataset and show that the hybrid quantum model outperforms its classical analog by 15% in predicting IC50 drug effectiveness values. The proposed hybrid quantum machine learning model is a step towards deep quantum data-efficient algorithms with thousands of quantum gates for solving problems in personalized medicine, where data collection is a challenge.
translated by 谷歌翻译
This paper proposes a non-data-driven deep neural network for spectral image recovery problems such as denoising, single hyperspectral image super-resolution, and compressive spectral imaging reconstruction. Unlike previous methods, the proposed approach, dubbed Mixture-Net, implicitly learns the prior information through the network. Mixture-Net consists of a deep generative model whose layers are inspired by the linear and non-linear low-rank mixture models, where the recovered image is composed of a weighted sum between the linear and non-linear decomposition. Mixture-Net also provides a low-rank decomposition interpreted as the spectral image abundances and endmembers, helpful in achieving remote sensing tasks without running additional routines. The experiments show the MixtureNet effectiveness outperforming state-of-the-art methods in recovery quality with the advantage of architecture interpretability.
translated by 谷歌翻译
基于视觉的导航需要处理复杂的信息以做出以任务为导向的决策。应用包括自动驾驶机器人,自动驾驶汽车以及对人类的辅助愿景。该过程中的关键要素之一是在像素空间中提取和选择相关特征,以便基于操作选择,适合哪种机器学习技术。但是,在模拟中接受培训的深度强化学习代理人在现实世界中部署在现实世界中通常会表现出不满意的结果,这是因为感知差异称为$ \ textit {现实gap} $。尚未探索以弥合这一差距的方法是自我注意力。在本文中,我们(1)对基于3D环境的基于自我注意力的导航进行系统探索,并从不同的超参数集中观察到的行为,包括它们的概括能力; (2)目前的策略来提高代理的概括能力和导航行为; (3)展示在模拟中训练的模型如何能够实时处理现实世界图像。据我们所知,这是使用少于4000个参数成功导航3D动作空间的基于自我注意力的代理的首次演示。
translated by 谷歌翻译